This notebook is for analyzing the Star Wars original trilogy scripts with various methods of text analysis.
## loading packages used in project.
## setting conflict preferences.
Load Star Wars data.
ep4<-fread("data/episode_iv.txt") %>%
mutate(document = "a new hope") %>%
mutate(line_number = row_number()) %>%
rename(character = Character,
dialogue = Dialogue) %>%
select(document, line_number, character, dialogue) %>%
as_tibble()
ep4 %>%
mutate(line_number = row_number()) %>%
select(line_number, character, dialogue)
We’ll tokenize this using tidytext. This enables us to display word counts, with and without stop words.
ep4 %>%
unnest_tokens(word, dialogue) %>%
group_by(document) %>%
count(word, sort = TRUE) %>%
slice_max(n, n=25) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)+
theme_phil()+
theme(axis.text.y=element_text(size=8))+
facet_wrap(document ~.)
data(stop_words)
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words, by="word") %>%
group_by(document) %>%
count(word, sort = TRUE) %>%
slice_max(n, n=25) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)+
theme_phil()+
theme(axis.text.y=element_text(size=8))+
facet_wrap(document ~.)
We can similarly do this for each main character.
word_counts<- ep4 %>%
unnest_tokens(word, dialogue) %>%
#anti_join(stop_words, by="word") %>%
group_by(document, character) %>%
count(word) %>%
group_by(document, character) %>%
summarize(words = sum(n),
.groups = "drop") %>%
arrange(desc(words)) %>%
filter(words > 25)
word_counts %>%
ggplot(., aes(y=reorder(tolower(character), words), x = words))+
geom_col()+
theme_phil()+
ylab("character")+
xlab("words spoken")+
facet_wrap(document ~.)
# show how we get here
word_runningcount<-ep4 %>%
unnest_tokens(word, dialogue) %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
group_by(document, character) %>%
mutate(main_character = case_when(character %in% (word_counts %>%
slice_max(words, n=10) %>%
pull(character))
~ tolower(character),
TRUE ~ "other")) %>%
mutate(count = 1,
running_count = cumsum(count),
final = case_when(running_count == max(running_count) ~ character,
TRUE ~ NA_character_))
library(randomcoloR)
n <- word_runningcount %>% group_by(document) %>% summarize(characters = n_distinct(character)) %>%
pull(characters)
set.seed(1)
palette <- distinctColorPalette(n)
# line chart
word_runningcount %>%
ungroup() %>%
mutate(word_number = row_number()) %>%
ggplot(., aes(x=line_number,
y=running_count,
label = final,
color = character,
group=character,
by = main_character))+
geom_line(lwd=1.5)+
theme_phil()+
geom_label_repel(nudge_x =1,
nudge_y = 0.25)+
guides(label = "none",
color = "none")+
facet_wrap(document ~.)+
scale_color_manual(values = palette)+
labs(y="Word Count",
x ="Movie Line")
# count by character
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words,
by = "word") %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
group_by(document, character) %>%
count(word, sort=T) %>%
filter(character %in% (word_counts %>% slice_max(words, n=12) %>% pull(character))) %>%
group_by(document, character) %>%
mutate(rank = row_number()) %>%
filter(rank<=10) %>%
ungroup() %>%
mutate(word = reorder_within(word, n, character)) %>%
mutate(character = reorder(character, desc(n))) %>%
ggplot(., aes(x=n, y=word, fill = character))+
geom_col(show.legend=F)+
scale_y_reordered()+
facet_wrap(~character, ncol=3, scales="free_y")+
theme_phil()+
scale_fill_manual(values = distinctColorPalette(12))+
theme(axis.text.y = element_text(size=8))
We can now use sentiment analysis to classify positive and negative words, which we can then count up using characters, etc. We’ll use the “bing” lexicon to start.
get_sentiments("bing") %>%
sample_n(10)
# get sentiment
ep4_sentiment<-
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words, by="word") %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
# # plot line
# ep4_sentiment %>%
# count(document, index = line_number %/% 1, sentiment) %>%
# pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
# mutate(sentiment = positive - negative) %>%
# ggplot(., aes(x=index, y=sentiment, sentiment))+
# geom_line()+
# theme_phil()+
# facet_wrap(document~.)
# plot bar
ep4_sentiment %>%
count(document, index = line_number %/% 8, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(., aes(x=index, y=sentiment, fill=sentiment))+
geom_col()+
theme_phil()+
facet_wrap(document~.)+
scale_fill_gradient2_tableau(limits=c(-4, 2), oob = scales::squish)+
theme(legend.title = element_text(size=8))+
guides(fill = guide_colourbar(title = "sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))
What are the most positive and negative sequences?
# most negative
ep4_sentiment %>%
count(document, line_number, index = line_number %/% 8, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
slice_min(sentiment, n=5) %>%
inner_join(., ep4) %>%
select(document, character, line_number, sentiment, dialogue) %>%
flextable() %>%
flextable::autofit()
## Joining, by = c("document", "line_number")
document | character | line_number | sentiment | dialogue |
a new hope | LEIA | 270 | -6 | General Kenobi, years ago you served my father in the Clone Wars. Now he begs you to help him in his struggle against the Empire. I regret that I am unable to present my father's request to you in person, but my ship has fallen under attack and I'm afraid my mission to bring you to Alderaan has failed. I have placed information vital to the survival of the Rebellion into the memory systems of this R2 unit. My father will know how to retrieve it. You must see this droid safely delivered to him on Alderaan. This is our most desperate hour. Help me, Obi-Wan Kenobi, you're my only hope. |
a new hope | HAN | 607 | -5 | Uh, uh, negative, negative. We had a reactor leak here now. Give us a few minutes to lock it down. Large leak... very dangerous. |
a new hope | BEN | 441 | -4 | I felt a great disturbance in the Force... as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. |
a new hope | THREEPIO | 24 | -3 | That's funny, the damage doesn't look as bad from out here. |
a new hope | THREEPIO | 128 | -3 | Thank the maker! This oil bath is going to feel so good. I've got such a bad case of dust contamination, I can barely move! |
a new hope | BEN | 529 | -3 | Who's the more foolish... the fool or the fool who follows him? |
# most positive
ep4_sentiment %>%
count(document, line_number, index = line_number %/% 8, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
slice_max(sentiment, n=5) %>%
inner_join(., ep4) %>%
select(document, character, line_number, sentiment, dialogue) %>%
flextable() %>%
flextable::autofit()
## Joining, by = c("document", "line_number")
document | character | line_number | sentiment | dialogue |
a new hope | THREEPIO | 222 | 3 | Master Luke here is your rightful owner. We'll have no more of this Obi-Wan Kenobi jibberish... and don't talk to me of your mission, either. You're fortunate he doesn't blast you into a million pieces right here. |
a new hope | LUKE | 574 | 3 | Yes. Rich, powerful! Listen, if you were to rescue her, the reward would be... |
a new hope | HAN | 605 | 3 | Uh... had a slight weapons malfunction. But, uh, everything's perfectly all right now. We're fine. We're all fine here, now, thank you. How are you? |
a new hope | LUKE | 76 | 2 | I'm sorry. I'm quiet. Listen how quiet I am. You can barely hear me... |
a new hope | BEN | 233 | 2 | Rest easy, son, you've had a busy day. You're fortunate you're still in one piece. |
a new hope | TAGGE | 282 | 2 | The Rebellion will continue to gain support in the Imperial Senate as long as.... |
a new hope | BEN | 298 | 2 | And these blast points, too accurate for Sand People. Only Imperial stormtroopers are so precise. |
a new hope | TARKIN | 435 | 2 | There. You see Lord Vader, she can be reasonable. Continue with the operation. You may fire when ready. |
a new hope | THREEPIO | 588 | 2 | Master Luke, sir! Pardon me for asking... but, ah... what should Artoo and I do if we're discovered here? |
a new hope | HAN | 687 | 2 | One thing's for sure. We're all going to be a lot thinner!Get on top of it! |
a new hope | HAN | 712 | 2 | No reward is worth this. |
a new hope | HAN | 783 | 2 | Easy... you call that easy? |
Hmm. The negative looks alright, but the positive is rather off. We’ll see how other approaches compare next, but let’s also check on characters.
Who are the most positive and negative characters?
# positive and negative characters
ep4_sentiment %>%
group_by(document, character) %>%
count(document, character, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment) %>%
ggplot(., aes(y=reorder(character, sentiment), x=sentiment, fill = sentiment))+
geom_col()+
theme_phil()+
scale_fill_gradient2_tableau(limits=c(-5, 2), oob = scales::squish)+
theme(legend.title = element_text(size=8))+
guides(fill = guide_colourbar(title = "sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))+
ylab("character")+
xlab("sentiment score")
Well, Luke is rather whiny, so…
Let’s try AFINN to see how its lexicon compares.
set.seed(10)
get_sentiments("afinn") %>%
sample_n(10)
Repeat the analysis before.
library(textdata)
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words, by="word") %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
inner_join(get_sentiments("afinn")) %>%
count(document, line_number, index = line_number %/% 8, value) %>%
group_by(document, index) %>%
summarize(sentiment = sum(value)) %>%
# pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
ggplot(., aes(x=index, y=sentiment, fill=sentiment))+
geom_col()+
theme_phil()+
facet_wrap(document~.) +
scale_fill_gradient2_tableau(limits=c(-8, 8), oob = scales::squish)+
theme(legend.title = element_text(size=8))+
guides(fill = guide_colourbar(title = "sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))
# check most positive as before
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words, by="word") %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
inner_join(get_sentiments("afinn")) %>%
count(document, line_number, index = line_number %/% 8, value) %>%
group_by(document, line_number) %>%
summarize(sentiment = sum(value)) %>%
slice_max(sentiment, n=5, with_ties = F) %>%
inner_join(., ep4) %>%
select(document, character, line_number, sentiment, dialogue) %>%
flextable() %>%
flextable::autofit()
document | character | line_number | sentiment | dialogue |
a new hope | TARKIN | 924 | 6 | Evacuate? In out moment of triumph? I think you overestimate their chances! |
a new hope | GREEDO | 366 | 5 | It's too late. You should have paid him when you had the chance. Jabba's put a price on your head, so large that every bounty hunter in the galaxy will be looking for you. I'm lucky I found you first. |
a new hope | HAN | 605 | 5 | Uh... had a slight weapons malfunction. But, uh, everything's perfectly all right now. We're fine. We're all fine here, now, thank you. How are you? |
a new hope | BEN | 746 | 5 | You can't win, Darth. If you strike me down, I shall become more powerful than you can possibly imagine. |
a new hope | HAN | 416 | 4 | Here's where the fun begins! |
# check most negative
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words, by="word") %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
inner_join(get_sentiments("afinn")) %>%
count(document, line_number, index = line_number %/% 8, value) %>%
group_by(document, line_number) %>%
summarize(sentiment = sum(value)) %>%
slice_min(sentiment, n=5, with_ties = F) %>%
inner_join(., ep4) %>%
select(document, character, line_number, sentiment, dialogue) %>%
flextable() %>%
flextable::autofit()
document | character | line_number | sentiment | dialogue |
a new hope | BEN | 257 | -6 | I have something here for you. Your father wanted you to have this when you were old enough, but your uncle wouldn't allow it. He feared you might follow old Obi-Wan on some damned-fool idealistic crusade like your father did. |
a new hope | HAN | 528 | -6 | Damn fool. I knew that you were going to say that! |
a new hope | VADER | 49 | -5 | Leave that to me. Send a distress signal and then inform the senate that all aboard were killed! |
a new hope | OWEN | 214 | -5 | Well, he'd better have those units in the south range repaired bemidday or there'll be hell to pay! |
a new hope | BEN | 441 | -5 | I felt a great disturbance in the Force... as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. |
# and characters
ep4 %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words, by="word") %>%
mutate(character = case_when(character == 'AUNT BERU' ~ 'BERU',
TRUE ~ character)) %>%
inner_join(get_sentiments("afinn")) %>%
count(document, character, line_number, index = line_number %/% 8, value) %>%
group_by(document, character) %>%
summarize(sentiment = sum(value)) %>%
ggplot(., aes(y=reorder(character, sentiment), x=sentiment, fill = sentiment))+
geom_col()+
theme_phil()+
scale_fill_gradient2_tableau(limits=c(-5, 2), oob = scales::squish)+
theme(legend.title = element_text(size=8))+
guides(fill = guide_colourbar(title = "sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))+
ylab("character")+
xlab("sentiment score")+
facet_wrap(document~.)
We’re encountering some of the limits of these lexicon based sentiment approaches, in that they struggle to get the context of a word within a sentence.
We can use sentimentr to illustrate the algorithimic approach.
library(sentimentr)
## Warning: package 'sentimentr' was built under R version 4.1.1
library(magrittr)
This breaks text down into sentences and then assesses sentiment for each sentence. For example, we can look at Princess Leia’s recorded speech.
ep4 %>%
filter(line_number == 270) %>%
pull(dialogue)
## [1] "General Kenobi, years ago you served my father in the Clone Wars. Now he begs you to help him in his struggle against the Empire. I regret that I am unable to present my father's request to you in person, but my ship has fallen under attack and I'm afraid my mission to bring you to Alderaan has failed. I have placed information vital to the survival of the Rebellion into the memory systems of this R2 unit. My father will know how to retrieve it. You must see this droid safely delivered to him on Alderaan. This is our most desperate hour. Help me, Obi-Wan Kenobi, you're my only hope."
ep4 %>%
filter(line_number == 270) %>%
get_sentences() %>%
sentiment_by(by = c('character', 'dialogue')) %>%
flextable() %>%
flextable::autofit()
character | dialogue | word_count | sd | ave_sentiment |
LEIA | General Kenobi, years ago you served my father in the Clone Wars. | 12 | -0.17320508 | |
LEIA | Help me, Obi-Wan Kenobi, you're my only hope. | 9 | 0.03333333 | |
LEIA | I have placed information vital to the survival of the Rebellion into the memory systems of this R2 unit. | 19 | 0.18353259 | |
LEIA | I regret that I am unable to present my father's request to you in person, but my ship has fallen under attack and I'm afraid my mission to bring you to Alderaan has failed. | 34 | -1.13617813 | |
LEIA | My father will know how to retrieve it. | 8 | 0.00000000 | |
LEIA | Now he begs you to help him in his struggle against the Empire. | 13 | -0.20801257 | |
LEIA | This is our most desperate hour. | 6 | -0.55113519 | |
LEIA | You must see this droid safely delivered to him on Alderaan. | 11 | 0.15075567 |
ep4 %>%
filter(line_number == 270) %>%
mutate(sentences = get_sentences(dialogue)) %$%
sentiment_by(sentences, list(document, character, line_number)) %>%
sentimentr::highlight()
## Saved in C:\Users\peh\AppData\Local\Temp\Rtmp0wBUYa/polarity.html
## Opening C:\Users\peh\AppData\Local\Temp\Rtmp0wBUYa/polarity.html ...
ep4 %>%
filter(line_number == 803) %>%
mutate(sentences = get_sentences(dialogue)) %$%
sentiment_by(sentences, list(document, character, line_number)) %>%
sentimentr::highlight()
## Saved in C:\Users\peh\AppData\Local\Temp\Rtmp0wBUYa/polarity.html
## Opening C:\Users\peh\AppData\Local\Temp\Rtmp0wBUYa/polarity.html ...
We can apply this to the entirety of the movie…
sentimentr_scores<-
ep4 %>%
get_sentences() %>%
sentiment_by(by = c('character', 'line_number'))
# plot by line number
sentimentr_scores %>%
arrange(line_number) %>%
mutate(index = line_number %/% 8) %>%
ggplot(., aes(x=line_number, y=ave_sentiment, fill = ave_sentiment))+
geom_col()+
theme_phil()+
scale_fill_gradient2_tableau(limits=c(-1, 1), oob = scales::squish)+
guides(fill = guide_colourbar(title = "sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))
# summarize it a bit
sentimentr_scores %>%
arrange(line_number) %>%
mutate(index = line_number %/% 8) %>%
group_by(index) %>%
summarize(ave_sentiment = sum(ave_sentiment)) %>%
ggplot(., aes(x=index, y=ave_sentiment, fill = ave_sentiment))+
geom_col()+
theme_phil()+
scale_fill_gradient2_tableau(limits=c(-3, 3), oob = scales::squish)+
guides(fill = guide_colourbar(title = "sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))
Grab distribution of sentiment for characters.
sentimentr_scores %>%
group_by(character) %>%
mutate(total_word_count = sum(word_count)) %>%
filter(total_word_count > 100) %>%
filter(ave_sentiment !=0) %>%
ggplot(., aes(y=reorder(character, total_word_count),
color = ave_sentiment,
x=ave_sentiment))+
geom_boxplot(alpha=0.1)+
geom_jitter(height=0.1, width=0)+
theme_phil()+
scale_color_gradient2_tableau(limits=c(-0.75, 0.75), oob = scales::squish)+
#theme(legend.title = element_text(size=8))+
guides(color = guide_colourbar(title = "average sentiment",
title.position = "top",
barwidth=8,
barheight=0.5))+
ylab("character")+
xlab("average sentiment")+
theme(panel.grid.major = element_blank())+
geom_vline(xintercept = 0)
Well we gotta explore this. Does Luke get less whiny throughout the movie?
# cumulative sentiment
ep4 %>%
filter(character== 'LUKE') %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
arrange(line_number) %>%
mutate(row_number = row_number()) %>%
filter(row_number > 65) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .35 ~ dialogue,
TRUE ~ "")) %>%
mutate(run_sentiment = cumsum(ave_sentiment)) %>%
ggplot(., aes(x=row_number, y=run_sentiment, label = show_negative))+
geom_line()+
geom_label_repel(size=2.5, max.overlaps=50)+
theme_phil()+
facet_wrap(document+character~.)+
xlab("line_number")+
ylab("running total of sentiment")
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# average senitment
ep4 %>%
filter(character== 'LUKE') %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
arrange(line_number) %>%
mutate(row_number = row_number()) %>%
#filter(row_number > 65) %>%
filter(word_count > 2) %>%
filter(ave_sentiment != 0) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .3 ~ dialogue,
TRUE ~ "")) %>%
mutate(run_sentiment = cumsum(ave_sentiment)) %>%
ggplot(., aes(x=row_number, y=ave_sentiment, label = show_negative))+
geom_point()+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=2.5, max.overlaps=50)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("average sentiment")+
xlab("line_number")+
geom_smooth(method = 'loess', formula = 'y ~ x')
Alright this is too much fun. Let’s look at some other characters.
# average senitment
ep4 %>%
filter(character== 'HAN') %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
arrange(line_number) %>%
filter(word_count > 2) %>%
mutate(row_number = row_number()) %>%
#filter(row_number > 65) %>%
filter(ave_sentiment != 0) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .3 ~ dialogue,
TRUE ~ "")) %>%
mutate(run_sentiment = cumsum(ave_sentiment)) %>%
ggplot(., aes(x=row_number, y=ave_sentiment, label = show_negative))+
geom_point()+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=2, max.overlaps=50)+
theme_phil()+
facet_wrap(document+character~., ncol=1)+
ylab("average sentiment")+
xlab("line_number")+
geom_smooth(method = 'loess', formula = 'y ~ x')
# average senitment
ep4 %>%
filter(character== 'THREEPIO') %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
arrange(line_number) %>%
filter(word_count > 2) %>%
mutate(row_number = row_number()) %>%
#filter(row_number > 65) %>%
filter(ave_sentiment != 0) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .3 ~ dialogue,
TRUE ~ "")) %>%
mutate(run_sentiment = cumsum(ave_sentiment)) %>%
ggplot(., aes(x=row_number, y=ave_sentiment, label = show_negative))+
geom_point()+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=2.5, max.overlaps=50)+
theme_phil()+
facet_wrap(document+character~., ncol=1)+
ylab("average sentiment")+
xlab("line_number")+
geom_smooth(method = 'loess', formula = 'y ~ x')
# most positive
ep4 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
arrange(line_number) %>%
group_by(line_number) %>%
summarize(sum_sentiment = sum(ave_sentiment)) %>%
arrange(desc(sum_sentiment)) %>%
slice_max(sum_sentiment, n=5) %>%
inner_join(., ep4) %>%
mutate_if(is.numeric, round, 3) %>%
select(document, character, line_number, sum_sentiment, dialogue) %>%
flextable() %>%
flextable::autofit()
## Joining, by = "line_number"
document | character | line_number | sum_sentiment | dialogue |
a new hope | HAN | 581 | 1.823 | All right, kid. But you'd better be right about this! |
a new hope | LUKE | 574 | 1.669 | Yes. Rich, powerful! Listen, if you were to rescue her, the reward would be... |
a new hope | THREEPIO | 125 | 1.600 | Uh, I'm quite sure you'll be very pleased with that one, sir. He really is in first-class condition. I've worked with him before. Here he comes. |
a new hope | LUKE | 184 | 1.592 | Yes, sir. I think those new droids are going to work out fine. In fact, I, uh, was also thinking about our agreement about my staying on another season. And if these new droids do work out, I want to transmit my application to the Academy this year. |
a new hope | TARKIN | 437 | 1.574 | You're far too trusting. Dantooine is too remote to make an effective demonstration. But don't worry. We will deal with your Rebel friends soon enough. |
# most negative
ep4 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
arrange(line_number) %>%
group_by(line_number) %>%
summarize(sum_sentiment = sum(ave_sentiment)) %>%
arrange(desc(sum_sentiment)) %>%
slice_min(sum_sentiment, n=5) %>%
inner_join(., ep4) %>%
mutate_if(is.numeric, round, 3) %>%
select(document, character, line_number, sum_sentiment, dialogue) %>%
flextable() %>%
flextable::autofit()
## Joining, by = "line_number"
document | character | line_number | sum_sentiment | dialogue |
a new hope | HAN | 607 | -2.200 | Uh, uh, negative, negative. We had a reactor leak here now. Give us a few minutes to lock it down. Large leak... very dangerous. |
a new hope | BEN | 264 | -1.710 | A young Jedi named Darth Vader, who was a pupil of mine until he turned to evil, helped the Empire hunt down and destroy the Jedi Knights. He betrayed and murdered your father. Now the Jedi are all but extinct. Vader was seduced by the dark side of the Force. |
a new hope | LEIA | 270 | -1.701 | General Kenobi, years ago you served my father in the Clone Wars. Now he begs you to help him in his struggle against the Empire. I regret that I am unable to present my father's request to you in person, but my ship has fallen under attack and I'm afraid my mission to bring you to Alderaan has failed. I have placed information vital to the survival of the Rebellion into the memory systems of this R2 unit. My father will know how to retrieve it. You must see this droid safely delivered to him on Alderaan. This is our most desperate hour. Help me, Obi-Wan Kenobi, you're my only hope. |
a new hope | BEN | 441 | -1.447 | I felt a great disturbance in the Force... as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. |
a new hope | THREEPIO | 1 | -1.426 | Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness! |
Let’s plot the cumulative sentiment for each movie and see how they compare.
# by point
ep4 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(word_count > 2) %>%
arrange(line_number) %>%
group_by(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
summarize(sum_sentiment = sum(ave_sentiment)) %>%
inner_join(., ep4) %>%
mutate(running_sentiment = cumsum(sum_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(sum_sentiment) > .6 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=running_sentiment, label = show_negative))+
geom_point(size=1)+
geom_line(linetype = 'dotted', lwd=0.9)+
geom_label_repel(size=3, max.overlaps=30)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("running total of sentiment")+
xlab("line_number")+
facet_wrap(document~.)
## Joining, by = "line_number"
## Warning: ggrepel: 57 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
### Emotions
Get all emotions
# find some sentences
ep4 %>%
get_sentences() %>%
emotion() %>%
arrange(desc(emotion))
# bar chart - percent of total
ep4 %>%
get_sentences() %>%
emotion() %>%
mutate(emotion_type = as.character(emotion_type)) %>%
filter(emotion_type == 'anger'|
emotion_type == 'disgust' |
emotion_type == 'sadness' |
emotion_type == 'fear' |
emotion_type == 'trust' |
emotion_type == 'anticipation' |
emotion_type == 'joy') %>%
mutate(emotion_type = case_when((emotion_type == 'anger' | emotion_type == 'fear') ~ 'anger & fear',
(emotion_type == 'trust' | emotion_type == 'joy') ~ 'trust & joy',
TRUE ~ emotion_type)) %>%
# mutate(running_emotion = cumsum(emotion)) %>%
mutate(index = line_number %/% 8) %>%
group_by(index, emotion_type) %>%
summarize(sum_emotion = sum(emotion)) %>%
group_by(index) %>%
arrange(sum_emotion) %>%
#group_by(emotion_type) %>%
#mutate(running_sum = cumsum(sum_emotion)) %>%
ggplot(., aes(x=index, y=sum_emotion, fill=emotion_type, color = emotion_type, group=emotion_type))+
geom_col(position = 'fill') +
theme_phil()+
scale_fill_colorblind()+
scale_color_colorblind()
## `summarise()` has grouped output by 'index'. You can override using the `.groups` argument.
# area chart
ep4 %>%
get_sentences() %>%
emotion() %>%
mutate(emotion_type = as.character(emotion_type)) %>%
filter(emotion_type == 'anger'|
emotion_type == 'disgust' |
emotion_type == 'sadness' |
emotion_type == 'fear' |
emotion_type == 'trust' |
emotion_type == 'anticipation' |
emotion_type == 'joy') %>%
mutate(emotion_type = case_when((emotion_type == 'anger' | emotion_type == 'fear') ~ 'anger & fear',
(emotion_type == 'trust' | emotion_type == 'joy') ~ 'trust & joy',
TRUE ~ emotion_type)) %>%
# mutate(running_emotion = cumsum(emotion)) %>%
mutate(index = line_number %/% 8) %>%
group_by(index, emotion_type) %>%
summarize(sum_emotion = sum(emotion)) %>%
group_by(emotion_type) %>%
mutate(running_sum = cumsum(sum_emotion)) %>%
ggplot(., aes(x=index, y=running_sum, fill=emotion_type, color = emotion_type, group=emotion_type))+
geom_area()+
theme_phil()+
scale_fill_colorblind()+
scale_color_colorblind()
## `summarise()` has grouped output by 'index'. You can override using the `.groups` argument.
# line chart of emotion
ep4 %>%
get_sentences() %>%
emotion() %>%
filter(emotion_type == 'anger'|
emotion_type == 'disgust' |
emotion_type == 'sadness' |
emotion_type == 'fear' |
emotion_type == 'trust' |
emotion_type == 'anticipation' |
emotion_type == 'joy') %>%
mutate(emotion_type = as.character(emotion_type)) %>%
mutate(emotion_type = case_when((emotion_type == 'anger' | emotion_type == 'fear') ~ 'anger & fear',
(emotion_type == 'trust' | emotion_type == 'joy') ~ 'trust & joy',
TRUE ~ emotion_type)) %>%
# mutate(running_emotion = cumsum(emotion)) %>%
mutate(index = line_number %/% 8) %>%
group_by(index, emotion_type) %>%
summarize(sum_emotion = sum(emotion)) %>%
group_by(emotion_type) %>%
mutate(running_sum = cumsum(sum_emotion)) %>%
ggplot(., aes(x=index, y=running_sum, fill=emotion_type, color = emotion_type, group=emotion_type))+
geom_line(lwd=1.1)+
theme_phil()+
scale_fill_colorblind()+
scale_color_colorblind()
## `summarise()` has grouped output by 'index'. You can override using the `.groups` argument.
## Episode V and VI
ep5<-fread("data/episode_v.txt") %>%
mutate(document = "the empire strikes back") %>%
mutate(line_number = row_number()) %>%
select(document, line_number, character, dialogue) %>%
as_tibble()
ep6<-fread("data/episode_vi.txt") %>%
mutate(document = "return of the jedi") %>%
mutate(line_number = row_number()) %>%
select(document, line_number, character, dialogue) %>%
as_tibble()
## Warning in fread("data/episode_vi.txt"): Found and resolved improper quoting
## out-of-sample. First healed line 191: <<"190" "LEIA" "He means \"You're welcome.
## \"">>. If the fields are not quoted (e.g. field separator does not appear within
## any field), try quote="" to avoid this warning.
starwars <- bind_rows(ep4, ep5, ep6)
ep4 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(word_count > 2) %>%
arrange(line_number) %>%
group_by(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
summarize(sum_sentiment = sum(ave_sentiment)) %>%
inner_join(., ep4) %>%
mutate(running_sentiment = cumsum(sum_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(sum_sentiment) > .6 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=running_sentiment, label = show_negative))+
geom_point(size=0.5)+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=1.5, max.overlaps=45)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("running total of sentiment")+
xlab("line_number")+
facet_wrap(document~.)
## Joining, by = "line_number"
## Warning: ggrepel: 45 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ep5 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(word_count > 2) %>%
arrange(line_number) %>%
group_by(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
summarize(sum_sentiment = sum(ave_sentiment)) %>%
inner_join(., ep5) %>%
mutate(running_sentiment = cumsum(sum_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(sum_sentiment) > .6 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=running_sentiment, label = show_negative))+
geom_point(size=0.5)+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=1.5, max.overlaps=45)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("running total of sentiment")+
xlab("line_number")+
facet_wrap(document~.)
## Joining, by = "line_number"
## Warning: ggrepel: 21 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ep6 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(word_count > 2) %>%
arrange(line_number) %>%
group_by(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
summarize(sum_sentiment = sum(ave_sentiment)) %>%
inner_join(., ep6) %>%
mutate(running_sentiment = cumsum(sum_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(sum_sentiment) > .6 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=running_sentiment, label = show_negative))+
geom_point(size=0.5)+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=1.5, max.overlaps=45)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("running total of sentiment")+
xlab("line_number")+
facet_wrap(document~.)
## Joining, by = "line_number"
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Palpy, just for John.
# without filtering word count
ep6 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(character == 'EMPEROR') %>%
filter(ave_sentiment != 0) %>%
#filter(word_count>2) %>%
arrange(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(running_sentiment = cumsum(ave_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .3 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=ave_sentiment, label = show_negative))+
geom_point(size=1)+
geom_line(linetype = 'dotted', lwd=0.9)+
geom_label_repel(size=3, max.overlaps=45)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("average sentiment")+
xlab("line_number")+
facet_wrap(character+document~.)+
geom_smooth(method = 'loess', formula = 'y ~ x')
# with the filter
ep6 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(character == 'EMPEROR') %>%
filter(ave_sentiment != 0) %>%
filter(word_count>2) %>%
arrange(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(running_sentiment = cumsum(ave_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .15 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=ave_sentiment, label = show_negative))+
geom_point(size=0.9)+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=1.5, max.overlaps=45)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("average sentiment")+
xlab("line_number")+
facet_wrap(character+document~.)+
geom_smooth(method = 'loess', formula = 'y ~ x')
Akbar
ep6 %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(character == 'ACKBAR') %>%
filter(ave_sentiment != 0) %>%
# filter(word_count>2) %>%
arrange(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(running_sentiment = cumsum(ave_sentiment)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .1 ~ dialogue,
TRUE ~ "")) %>%
ggplot(., aes(x=row_number, y=ave_sentiment, label = show_negative))+
geom_point(size=0.5)+
geom_line(linetype = 'dotted', lwd=0.5)+
geom_label_repel(size=2.5, max.overlaps=45)+
theme_phil()+
facet_wrap(document+character~.)+
ylab("average sentiment")+
xlab("line_number")+
facet_wrap(character+document~.)+
geom_smooth(method = 'loess', formula = 'y ~ x')
John request: plot the big wigs.
big_wigs<-c('LUKE', 'LEIA', 'HAN', 'BEN', 'VADER', 'YODA', 'THREEPIO', 'LANDO', 'EMPEROR')
set.seed(30)
palette = distinctColorPalette(length(big_wigs))
starwars %>%
mutate(character = case_when(document == 'the empire strikes back' & character == 'CREATURE' ~ 'YODA',
TRUE ~ character)) %>%
filter(character %in% big_wigs) %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
filter(ave_sentiment != 0) %>%
# filter(word_count>2) %>%
arrange(line_number) %>%
mutate(dialogue = as.character(dialogue)) %>%
mutate(document = factor(document, levels = c('a new hope', 'the empire strikes back', 'return of the jedi'))) %>%
group_by(document, character) %>%
mutate(running_sentiment = cumsum(ave_sentiment)) %>%
mutate(running_average = rollapply(ave_sentiment,2, mean,align='right',fill=NA)) %>%
mutate(row_number = row_number()) %>%
mutate(show_negative = case_when(abs(ave_sentiment) > .5 ~ dialogue,
TRUE ~ "")) %>%
mutate(max = case_when(row_number == max(row_number) ~ character,
TRUE ~ NA_character_)) %>%
ungroup() %>%
mutate(row_number = row_number()) %>%
ggplot(., aes(x=row_number, y=running_sentiment, group=character, color = character, label = max))+
# geom_point(size=0.25)+
geom_line(lwd=1.1)+
geom_label_repel(nudge_x = 1.5,
nudge_y = 1)+
# geom_label_repel(size=1.5, max.overlaps=150)+
theme_phil()+
facet_wrap(document~., ncol=1)+
ylab("running total of sentiment")+
xlab("line_number")+
scale_color_manual(values = palette)+
guides(color = F)+
theme(panel.grid.major = element_line(size = 0.25))+
# theme(panel.grid.major=element_blank())+
geom_hline(yintercept = 0, linetype = 'dotted', col="grey60")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 1825 rows containing missing values (geom_label_repel).
# moving average
rolling_mean <- rollify(mean, window = 3)
# starwars %>%
# mutate(character = case_when(document == 'the empire strikes back' & character == 'CREATURE' ~ 'YODA',
# TRUE ~ character)) %>%
# filter(character %in% big_wigs) %>%
# get_sentences() %>%
# sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
# filter(ave_sentiment != 0) %>%
# # filter(word_count>2) %>%
# arrange(line_number) %>%
# mutate(dialogue = as.character(dialogue)) %>%
# mutate(document = factor(document, levels = c('a new hope', 'the empire strikes back', 'return of the jedi'))) %>%
# group_by(document, character) %>%
# mutate(running_sentiment = cumsum(ave_sentiment)) %>%
# mutate(running_average = rolling_mean(ave_sentiment)) %>%
# mutate(row_number = row_number()) %>%
# mutate(show_negative = case_when(abs(ave_sentiment) > .5 ~ dialogue,
# TRUE ~ "")) %>%
# mutate(max = case_when(row_number == max(row_number) ~ character,
# TRUE ~ NA_character_)) %>%
# ungroup() %>%
# mutate(row_number = row_number()) %>%
# ggplot(., aes(x=line_number, y=running_average, group=character, color = character, label = max))+
# # geom_point(size=0.25)+
# geom_line(lwd=1.1)+
# geom_label_repel(nudge_x = 1.5,
# nudge_y = 1)+
# # geom_label_repel(size=1.5, max.overlaps=150)+
# theme_phil()+
# facet_wrap(document~., ncol=1)+
# ylab("moving average of sentiment")+
# xlab("line_number")+
# scale_color_manual(values = palette)+
# guides(color = F)+
# theme(panel.grid.major = element_line(size = 0.25))+
# # theme(panel.grid.major=element_blank())+
# geom_hline(yintercept = 0, linetype = 'dotted', col="grey60")
#
If you just grab the average/sum sentiment, who comes out on top?
starwars %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
group_by(document, character) %>%
filter(ave_sentiment !=0) %>%
summarize(sentences = n_distinct(line_number),
avg_sentiment = mean(ave_sentiment),
sum_sentiment = sum(ave_sentiment)) %>%
filter(sentences > 5) %>%
# arrange(desc(sentences))
ggplot(., aes(x=avg_sentiment,
fill=avg_sentiment,
y=reorder_within(character, avg_sentiment, document)))+
geom_col()+
scale_y_reordered()+
facet_wrap(~document, ncol =1, scales="free_y")+
theme_phil()+
scale_fill_gradient2_tableau()+
guides(fill="none")+
theme(axis.text.y = element_text(size=8))+
ylab("")+
xlab("Average Sentiment")
## `summarise()` has grouped output by 'document'. You can override using the `.groups` argument.
starwars %>%
get_sentences() %>%
sentiment_by(by = c('document', 'character', 'line_number', 'dialogue')) %>%
group_by(document, character) %>%
mutate(document = factor(document, levels = c("a new hope", "the empire strikes back", "return of the jedi"))) %>%
filter(ave_sentiment !=0) %>%
summarize(sentences = n_distinct(line_number),
avg_sentiment = mean(ave_sentiment),
sum_sentiment = sum(ave_sentiment)) %>%
filter(sentences > 5) %>%
# arrange(desc(sentences))
ggplot(., aes(x=sum_sentiment,
fill=sum_sentiment,
y=reorder_within(character, sum_sentiment, document)))+
geom_col()+
scale_y_reordered()+
facet_wrap(~document, ncol =1, scales="free_y")+
theme_phil()+
# scale_fill_gradient2_tableau()+
guides(fill=F)+
theme(axis.text.y = element_text(size=8))+
ylab("")+
xlab("Summed Sentiment")
## `summarise()` has grouped output by 'document'. You can override using the `.groups` argument.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Let’s compute the tf-idf by movie.
starwars_tf_idf<-starwars %>%
unnest_tokens(word, dialogue) %>%
count(document, word, sort=T) %>%
left_join(., starwars %>%
unnest_tokens(word, dialogue) %>%
count(document, word, sort=T) %>%
group_by(document) %>%
summarize(total = sum(n))) %>%
bind_tf_idf(word, document, n) %>%
arrange(desc(tf_idf))
## Joining, by = "document"
starwars_tf_idf
starwars_tf_idf %>%
mutate(document = factor(document, levels = c('a new hope', 'the empire strikes back', 'return of the jedi'))) %>%
group_by(document) %>%
slice_max(tf_idf, n=20, with_ties = F) %>%
ungroup() %>%
ggplot(., aes(tf_idf, reorder_within(word, tf_idf, document), fill = document)) +
geom_col(show.legend = F) +
facet_wrap(~document, ncol=2, scales="free")+
theme_phil()+
labs(x="tf_idf", y="")+
theme(axis.text.y = element_text(size=8))+
scale_y_reordered()+
scale_fill_colorblind()
starwars_bigrams<-starwars %>%
unnest_tokens(bigram, dialogue, token = "ngrams", n = 2)
bigrams_separated<-starwars_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts %>%
filter(!is.na(word1))
library(igraph)
bigram_graph <- bigram_counts %>%
filter(!is.na(word1)) %>%
filter(n > 2) %>%
graph_from_data_frame()
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.1.1
set.seed(32)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Let’s apply topic modeling to Star Wars. This requires converting to a document term matrix
library(tm)
## Warning: package 'tm' was built under R version 4.1.1
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:fabletools':
##
## features
## The following object is masked from 'package:ggplot2':
##
## annotate
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 4.1.1
starwars_dtm<-starwars %>%
unnest_tokens(word, dialogue) %>%
anti_join(stop_words) %>%
count(document, line_number, word, sort=T) %>%
mutate(document_line_number = paste(document, line_number, sep="_")) %>%
cast_dtm(document_line_number, word, n)
## Joining, by = "word"
starwars_dtm %>%
tidy() %>%
head()
starwars_lda <- LDA(starwars_dtm, k = 4, control = list(seed = 1234))
starwars_lda
## A LDA_VEM topic model with 4 topics.
Now opening up the LDA
tidy(starwars_lda, matrix="beta") %>%
group_by(topic) %>%
slice_max(beta, n=25) %>%
ungroup() %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()+
theme_phil()
Document (line number) probabilities by topic.
# topic 1
tidy(starwars_lda, matrix="gamma") %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = gamma) %>%
arrange(document) %>%
left_join(., starwars %>%
mutate(document = paste(document, line_number, sep="_"))) %>%
select(document, starts_with('topic'), character, dialogue) %>%
arrange(desc(topic1)) %>%
mutate_if(is.numeric, round, 3) %>%
head(10) %>%
flextable() %>%
flextable::autofit()
## Joining, by = "document"
document | topic1 | topic2 | topic3 | topic4 | character | dialogue |
return of the jedi_40 | 0.976 | 0.008 | 0.008 | 0.008 | LUKE | Greetings, Exalted One. Allow me to introduce myself. I am Luke Skywalker, Jedi Knight and friend to Captain Solo. I know that you are powerful, mighty Jabba, and that your anger with Solo must be equally powerful. I seek an audience with Your Greatness to bargain for Solo's life. With your wisdom, I'm sure that we can work out an arrangement which will be mutually beneficial and enable us to avoid any unpleasant confrontation. As a token of my goodwill, I present to you a gift: these two droids. |
the empire strikes back_480 | 0.971 | 0.010 | 0.010 | 0.010 | YODA | Run! Yes. A Jedi's strength flows from the Force. But beware of the dark side. Anger...fear...aggression. The dark side of the Force are they. Easily they flow, quick to join you in a fight. If once you start down the dark path, forever will it dominate your destiny, consume you it will, as it did Obi-Wan's apprentice. |
a new hope_264 | 0.964 | 0.012 | 0.012 | 0.012 | BEN | A young Jedi named Darth Vader, who was a pupil of mine until he turned to evil, helped the Empire hunt down and destroy the Jedi Knights. He betrayed and murdered your father. Now the Jedi are all but extinct. Vader was seduced by the dark side of the Force. |
return of the jedi_245 | 0.963 | 0.012 | 0.013 | 0.012 | BEN | To be a Jedi, Luke, you must confront and then go beyond the dark side - the side your father couldn't get past. Impatience is the easiest door - for you, like your father. Only, your father was seduced by what he found on the other side of the door, and you have held firm. You're no longer so reckless now, Luke. You are strong and patient. And now, you must face Darth Vader again! |
return of the jedi_225 | 0.962 | 0.012 | 0.012 | 0.013 | YODA | Luke...Luke...Do not...Do not underestimate the powers of the Emperor, or suffer your father's fate, you will. Luke, when gone am I , the last of the Jedi will you be. Luke, the Force runs strong in your family. Pass on what you have learned, Luke... There is... another...Sky...Sky...walker. |
the empire strikes back_637 | 0.956 | 0.015 | 0.015 | 0.015 | YODA | Stopped they must be. On this depends. Only a fully trained Jedi Knight with the Force as his ally will conquer Vader and his Emperor. If you end your training now, if you choose the quick and easy path, as Vader did, you will become an agent of evil. |
return of the jedi_224 | 0.952 | 0.016 | 0.016 | 0.016 | YODA | Remember, a Jedi's strength flows from the Force. But beware. Anger, fear, aggression. The dark side are they. Once you start down the dark path, forever will it dominate your destiny. |
the empire strikes back_779 | 0.951 | 0.017 | 0.016 | 0.016 | VADER | There is no escape. Don't make me destroy you. You do not yet realize your importance. You have only begun to discover you power. Join me and I will complete your training. With our combined strength, we can end this destructive conflict and bring order to the galaxy. |
a new hope_236 | 0.945 | 0.018 | 0.018 | 0.018 | LUKE | Oh, this little droid! I think he's searching for his former master... I've never seen such devotion in a droid before... there seems tobe no stopping him. He claims to be the property of an Obi-Wan Kenobi. Is he a relative of yours? Do you know who he's talking about? |
a new hope_441 | 0.937 | 0.021 | 0.021 | 0.021 | BEN | I felt a great disturbance in the Force... as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened. |
# topic 2
tidy(starwars_lda, matrix="gamma") %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = gamma) %>%
arrange(document) %>%
left_join(., starwars %>%
mutate(document = paste(document, line_number, sep="_"))) %>%
select(document, starts_with('topic'), character, dialogue) %>%
arrange(desc(topic2)) %>%
mutate_if(is.numeric, round, 3) %>%
head(10) %>%
flextable() %>%
flextable::autofit()
## Joining, by = "document"
document | topic1 | topic2 | topic3 | topic4 | character | dialogue |
a new hope_33 | 0.020 | 0.941 | 0.020 | 0.020 | BIGGS | Of course I got it. Signed aboard The Rand Ecliptic last week. First mate Biggs Darklighter at your service...... I just came back to say goodbye to all you unfortunate landlocked simpletons. |
the empire strikes back_304 | 0.025 | 0.925 | 0.025 | 0.025 | THREEPIO | Sir, the possibility of successfully navigating an asteroid field is approximately three thousand, seven hundred and twenty to one. |
a new hope_106 | 0.027 | 0.918 | 0.027 | 0.027 | THREEPIO | Vaporators! Sir - My first job was programming binary load lifters... very similar to your vaporators. You could say... |
return of the jedi_545 | 0.027 | 0.916 | 0.028 | 0.029 | ACKBAR | Take evasive action! Green Group, stick close to holding sector MV-7. |
the empire strikes back_543 | 0.031 | 0.908 | 0.030 | 0.031 | HAN | The fleet is beginning to break up. Go back and stand by the manual release for the landing claw. |
return of the jedi_59 | 0.034 | 0.906 | 0.030 | 0.030 | NINEDENINE | You're a feisty little one, but you'll soon learn some respect. I have need for you on the master's Sail Barge. And I think you'll fit in nicely. |
a new hope_648 | 0.034 | 0.898 | 0.034 | 0.034 | HAN | Oh! The garbage chute was a really wonderful idea. What an incredible smell you've discovered! Let's get out of here! Get away from there... |
the empire strikes back_26 | 0.034 | 0.898 | 0.034 | 0.034 | HAN | Well, the bounty hunter we ran into on Ord Mantell changed my mind. |
the empire strikes back_406 | 0.034 | 0.897 | 0.034 | 0.035 | THREEPIO | Sir, sir! I've isolated the reverse power flux coupling. |
return of the jedi_309 | 0.034 | 0.896 | 0.035 | 0.034 | CONTROLLER | Shuttle Tydirium, transmit the clearance code for shield passage. |
# topic 3
tidy(starwars_lda, matrix="gamma") %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = gamma) %>%
arrange(document) %>%
left_join(., starwars %>%
mutate(document = paste(document, line_number, sep="_"))) %>%
select(document, starts_with('topic'), character, dialogue) %>%
arrange(desc(topic3)) %>%
mutate_if(is.numeric, round, 3) %>%
head(10) %>%
flextable() %>%
flextable::autofit()
## Joining, by = "document"
document | topic1 | topic2 | topic3 | topic4 | character | dialogue |
a new hope_398 | 0.017 | 0.017 | 0.949 | 0.017 | JABBA | Han, Han! If only you hadn't had to dump that shipment of spice... you understand I just can't make an exception. Where would I be if every pilot who smuggled for me dumped their shipment at the first sign of an Imperial starship? It's not good business. |
a new hope_412 | 0.021 | 0.023 | 0.934 | 0.022 | HAN | It looks like an Imperial cruiser. Our passengers must be hotter than I thought. Try and hold them off. Angle the deflector shield while I make the calculations for the jump to light speed. |
a new hope_415 | 0.025 | 0.025 | 0.924 | 0.027 | HAN | Watch your mouth, kid, or you're going to find yourself floating home. We'll be safe enough once we make the jump to hyperspace. Besides, I know a few maneuvers. We'll lose them! |
the empire strikes back_691 | 0.027 | 0.027 | 0.918 | 0.027 | THREEPIO | Oh, yes, that's very good. I like that. Oh! Something's not right because now I can't see. Wait. Wait! Oh, my! what have you done? I'm backwards, you stupid furball. Only an overgrown mophead like you would be stupid enough... |
the empire strikes back_547 | 0.027 | 0.029 | 0.914 | 0.029 | HAN | Well, if they follow standard Imperial procedure, they'll dump their garbage before they go to light-speed, then we just float away. |
a new hope_809 | 0.030 | 0.031 | 0.909 | 0.030 | LUKE | It's not impossible. I used to bullseye womp rats in my T-sixteen back home. They're not much bigger than two meters. |
a new hope_49 | 0.031 | 0.031 | 0.908 | 0.030 | VADER | Leave that to me. Send a distress signal and then inform the senate that all aboard were killed! |
a new hope_306 | 0.030 | 0.031 | 0.908 | 0.030 | BEN | Mos Eisley Spaceport. You will never find a more wretched hive of scum and villainy. We must be cautious. |
a new hope_370 | 0.031 | 0.030 | 0.908 | 0.031 | GREEDO | Jabba's through with you. He has no time for smugglers who drop their shipments at the first sign of an Imperial cruiser. |
return of the jedi_627 | 0.033 | 0.030 | 0.905 | 0.032 | HAN/PILOT | It's over, Commander. The Rebels have been routed. They're fleeing into the woods. We need reinforcements to continue the pursuit. |
# topic 4
tidy(starwars_lda, matrix="gamma") %>%
mutate(topic = paste0("topic", topic)) %>%
pivot_wider(names_from = topic, values_from = gamma) %>%
arrange(document) %>%
left_join(., starwars %>%
mutate(document = paste(document, line_number, sep="_"))) %>%
select(document, starts_with('topic'), character, dialogue) %>%
arrange(desc(topic4)) %>%
mutate_if(is.numeric, round, 3) %>%
head(10) %>%
flextable() %>%
flextable::autofit()
## Joining, by = "document"
document | topic1 | topic2 | topic3 | topic4 | character | dialogue |
return of the jedi_255 | 0.006 | 0.006 | 0.006 | 0.982 | BEN | The Organa household was high-born and politically quite powerful in that system. Leia became a princess by virtue of lineage... no one knew she'd been adopted, of course. But it was a title without real power, since Alderaan had long been a democracy. Even so, the family continued to be politically powerful, and Leia, following in her foster father's path, became a senator as well. That's not all she became, of course... she became the leader of her cell in the Alliance against the corrupt Empire. And because she had diplomatic immunity, she was a vital link for getting information to the Rebel cause. That's what she was doing when her path crossed yours... for her foster parents had always told her to contact me on Tatooine, if her troubles became desperate. |
return of the jedi_266 | 0.007 | 0.007 | 0.007 | 0.979 | ACKBAR | You can see here the Death Star orbiting the forest Moon of Endor. Although the weapon systems on this Death Star are not yet operational, the Death Star does have a strong defense mechanism. It is protected by an energy shield, which is generated from the nearby forest Moon of Endor. The shield must be deactivated if any attack is to be attempted. Once the shield is down, our cruisers will create a perimeter, while the fighters fly into the superstructure and attempt to knock out the main reactor. |
a new hope_66 | 0.012 | 0.012 | 0.013 | 0.963 | LUKE | ... so I cut off my power, shut down the afterburners and came in low on Deak's trail. I was so close I thought I was going to fry my instruments. As it was I busted up the Skyhopper pretty bad. Uncle Owen was pretty upset. He grounded me for the rest of the season. You should have been there... it was fantastic. |
a new hope_803 | 0.015 | 0.015 | 0.014 | 0.956 | DODONNA | The battle station is heavily shielded and carries a firepower greater than half the star fleet.Its defenses are designed around a direct large-scale assault. A small one-man fighter should be able to penetrate the outer defense. |
a new hope_420 | 0.017 | 0.016 | 0.017 | 0.950 | HAN | Traveling through hyperspace isn't like dusting crops, boy! Without precise calculations we could fly right through a star or bounce too close to a supernova and that'd end your trip real quick, wouldn't it? |
a new hope_472 | 0.017 | 0.016 | 0.017 | 0.949 | OFFICER CASS | Our scout ships have reached Dantooine. They found the remains of a Rebel base, but they estimate that it has been deserted for some time. They are now conducting an extensive search of the surrounding systems. |
the empire strikes back_174 | 0.017 | 0.018 | 0.017 | 0.948 | LEIA | The ion cannon will fire several shots to make sure that any enemy ships will be out of your flight path. When you've gotten past the energy shield, proceed directly to the rendezvous point. Understood? |
a new hope_805 | 0.018 | 0.018 | 0.018 | 0.945 | DODONNA | Well, the Empire doesn't consider a small one-man fighter to be any threat, or they'd have a tighter defense. An analysis of the plans provided by Princess Leia has demonstrated a weakness in the battle station. |
a new hope_288 | 0.019 | 0.018 | 0.018 | 0.944 | MOTTI | Any attack made by the Rebels against this station would be a useless gesture, no matter what technical data they've obtained. This station is now the ultimate power in the universe. I suggest we use it! |
a new hope_347 | 0.021 | 0.022 | 0.021 | 0.936 | HAN | I've outrun Imperial starships, not the local bulk-cruisers, mind you. I'm talking about the big Corellian ships now. She's fast enough for you, old man. What's the cargo? |